samlldoges

arXiv huggingface License: Apache-2.0

Small Doges is under construction, let's develop together!

English | 简体中文

About

drawing

As shown in the figure below, the sequence transformation part of the Doge architecture uses Dynamic Mask Attention, which can be understood as using self-attention related to value states during training, and using state-space without past state decay during inference, to solve the problem of existing Transformers or SSMs getting lost in long text. The state transformation part of Doge uses Cross Domain Mixture of Experts, which consists of dense linear layers and sparse embedding layers, and can additionally increase sparse parameters to continue training from dense weight checkpoints without retraining the entire model, thereby reducing the cost of continuous iteration of the model. In addition, Doge also uses RMSNorm and Residual with learnable parameters to adapt the gradient range of deep models.

Dynamic Mask Attention Module

DMAttn DMAttn

Cross Domain Mixture of Experts Module

CDMoE CDMoE

We also hope to use open-source tools and frameworks as much as possible to simplify the process from data processing to model training, so that beginners can easily understand and use them.

Requirements

We highly recommend that you install the latest version of PyTorch and CUDA for optimal performance.

Of course, you can also use the open-source Docker PyTorch image to avoid the hassle of configuring the environment.

docker pull nvcr.io/nvidia/pytorch:24.12-py3
docker run --privileged --gpus all -it --name PyTorch --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 -v <your code path>:/workspace -v <your datasets path>:/workspace/Doge/datasets nvcr.io/nvidia/pytorch:24.12-py3

Installation

git clone https://github.com/SamllDoge/SmallDoges.git
cd SmallDoges
pip install -e .

Usage

We have written a notebook (still being updated) to demonstrate the entire process of datasets processing, model training, and model evaluation. You can use the following complete architecture or individual modules.

Models Released

Doge-CheckPoint

wsd_scheduler

Doge uses wsd_scheduler as the training scheduler, which divides the learning rate into three stages: warmup, stable, and decay. It allows us to continue training on any new dataset from any checkpoint in the stable stage without spikes of the training.

Here are the initial learning rates required to continue training at each checkpoint:

Model Learning Rate Schedule Warmup Steps Stable Steps
Doge-20M 8e-3 wsd_scheduler 800 6400
Doge-60M 6e-3 wsd_scheduler 1600 12800
Doge-160M 4e-3 wsd_scheduler 2400 19200
Doge-320M 2e-3 wsd_scheduler 3200 25600

Doge-SLM

Pre-Training:

Model Training Data Steps Content Length Tokens LR Batch Size Precision
Doge-20M HuggingFaceTB/smollm-corpus 8k 2048 4B 8e-3 0.5M bfloat16
Doge-60M HuggingFaceTB/smollm-corpus 16k 2048 16B 6e-3 1M bfloat16

Evaluation:

Model MMLU TriviaQA ARC-E ARC-C PIQA HellaSwag OBQA Winogrande tokens / s on CPU
Doge-20M 25.43 0.03 36.83 22.78 58.38 27.25 25.60 50.20 142
Doge-60M 26.41 0.18 50.46 25.34 61.43 31.45 28.00 50.75 62

All evaluations are done using five-shot settings, without additional training on the benchmarks.

SFT:

Model Training Data Epochs Content Length LR Batch Size Precision
Doge-20M-Instruct-SFT HuggingFaceTB/smoltalk 2 2048 8e-4 0.25M bfloat16
Doge-60M-Instruct-SFT HuggingFaceTB/smoltalk 2 2048 6e-4 0.25M bfloat16

DPO:

Model Training Data Epochs Content Length LR Batch Size Precision
Doge-20M-Instruct HuggingFaceH4/ultrafeedback_binarized 2 1024 8e-5 0.125M bfloat16
Doge-60M-Instruct HuggingFaceH4/ultrafeedback_binarized 2 1024 6e-5 0.125M bfloat16

Environment:

Citation

If you use this codebase, or otherwise find our work valuable, please cite our paper:

@misc{shi2024wonderfulmatrices,
      title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture}, 
      author={Jingze Shi and Bingheng Wu},
      year={2024},
      eprint={2412.11834},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.11834}, 
}